103 research outputs found

    Humans and deep networks largely agree on which kinds of variation make object recognition harder

    Get PDF
    View-invariant object recognition is a challenging problem, which has attracted much attention among the psychology, neuroscience, and computer vision communities. Humans are notoriously good at it, even if some variations are presumably more difficult to handle than others (e.g. 3D rotations). Humans are thought to solve the problem through hierarchical processing along the ventral stream, which progressively extracts more and more invariant visual features. This feed-forward architecture has inspired a new generation of bio-inspired computer vision systems called deep convolutional neural networks (DCNN), which are currently the best algorithms for object recognition in natural images. Here, for the first time, we systematically compared human feed-forward vision and DCNNs at view-invariant object recognition using the same images and controlling for both the kinds of transformation as well as their magnitude. We used four object categories and images were rendered from 3D computer models. In total, 89 human subjects participated in 10 experiments in which they had to discriminate between two or four categories after rapid presentation with backward masking. We also tested two recent DCNNs on the same tasks. We found that humans and DCNNs largely agreed on the relative difficulties of each kind of variation: rotation in depth is by far the hardest transformation to handle, followed by scale, then rotation in plane, and finally position. This suggests that humans recognize objects mainly through 2D template matching, rather than by constructing 3D object models, and that DCNNs are not too unreasonable models of human feed-forward vision. Also, our results show that the variation levels in rotation in depth and scale strongly modulate both humans' and DCNNs' recognition performances. We thus argue that these variations should be controlled in the image datasets used in vision research

    Learning Mechanisms to account for the Speed, Selectivity and Invariance of Responses in the visual Cortex

    Get PDF
    Dans cette thèse je propose plusieurs mécanismes de plasticité synaptique qui pourraient expliquer la rapidité, la sélectivité et l'invariance des réponses neuronales dans le cortex visuel. Leur plausibilité biologique est discutée. J'expose également les résultats d'une expérience de psychophysique pertinente, qui montrent que la familiarité peut accélérer les traitements visuels. Au delà de ces résultats propres au système visuel, les travaux présentés ici créditent l'hypothèse de l'utilisation des dates de spikes pour encoder, décoder, et traiter l'information dans le cerveau - c'est la théorie dite du 'codage temporel'. Dans un tel cadre, la Spike Timing Dependent Plasticity pourrait jouer un rôle clef, en détectant des patterns de spikes répétitifs et en permettant d'y répondre de plus en plus rapidement.In this thesis I propose various activity-driven synaptic plasticity mechanisms that could account for the speed, selectivity and invariance of the neuronal responses in the visual cortex. Their biological plausibility is discussed. I also present the results of a relevant psychophysical experiment demonstrating that familiarity can accelerate visual processing. Beyond these results on the visual system, the studies presented here also credit the hypothesis that the brain uses the spike times to encode, decode, and process information - a theory referred to as 'temporal coding'. In such a framework the Spike Timing Dependent Plasticity may play a key role, by detecting repeating spike patterns and by generating faster and faster responses to those patterns

    Learning Delays in Spiking Neural Networks using Dilated Convolutions with Learnable Spacings

    Full text link
    Spiking Neural Networks (SNNs) are a promising research direction for building power-efficient information processing systems, especially for temporal tasks such as speech recognition. In SNNs, delays refer to the time needed for one spike to travel from one neuron to another. These delays matter because they influence the spike arrival times, and it is well-known that spiking neurons respond more strongly to coincident input spikes. More formally, it has been shown theoretically that plastic delays greatly increase the expressivity in SNNs. Yet, efficient algorithms to learn these delays have been lacking. Here, we propose a new discrete-time algorithm that addresses this issue in deep feedforward SNNs using backpropagation, in an offline manner. To simulate delays between consecutive layers, we use 1D convolutions across time. The kernels contain only a few non-zero weights - one per synapse - whose positions correspond to the delays. These positions are learned together with the weights using the recently proposed Dilated Convolution with Learnable Spacings (DCLS). We evaluated our method on three datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC) and its non-spiking version Google Speech Commands v0.02 (GSC) benchmarks, which require detecting temporal patterns. We used feedforward SNNs with two or three hidden fully connected layers, and vanilla leaky integrate-and fire neurons. We showed that fixed random delays help and that learning them helps even more. Furthermore, our method outperformed the state-of-the-art in the three datasets without using recurrent connections and with substantially fewer parameters. Our work demonstrates the potential of delay learning in developing accurate and precise models for temporal data processing. Our code is based on PyTorch / SpikingJelly and available at: https://github.com/Thvnvtos/SNN-delay

    Dilated convolution with learnable spacings

    Full text link
    Recent works indicate that convolutional neural networks (CNN) need large receptive fields (RF) to compete with visual transformers and their attention mechanism. In CNNs, RFs can simply be enlarged by increasing the convolution kernel sizes. Yet the number of trainable parameters, which scales quadratically with the kernel's size in the 2D case, rapidly becomes prohibitive, and the training is notoriously difficult. This paper presents a new method to increase the RF size without increasing the number of parameters. The dilated convolution (DC) has already been proposed for the same purpose. DC can be seen as a convolution with a kernel that contains only a few non-zero elements placed on a regular grid. Here we present a new version of the DC in which the spacings between the non-zero elements, or equivalently their positions, are no longer fixed but learnable via backpropagation thanks to an interpolation technique. We call this method "Dilated Convolution with Learnable Spacings" (DCLS) and generalize it to the n-dimensional convolution case. However, our main focus here will be on the 2D case. We first tried our approach on ResNet50: we drop-in replaced the standard convolutions with DCLS ones, which increased the accuracy of ImageNet1k classification at iso-parameters, but at the expense of the throughput. Next, we used the recent ConvNeXt state-of-the-art convolutional architecture and drop-in replaced the depthwise convolutions with DCLS ones. This not only increased the accuracy of ImageNet1k classification but also of typical downstream and robustness tasks, again at iso-parameters but this time with negligible cost on throughput, as ConvNeXt uses separable convolutions. Conversely, classic DC led to poor performance with both ResNet50 and ConvNeXt. The code of the method is available at: https://github.com/K-H-Ismail/Dilated-Convolution-with-Learnable-Spacings-PyTorch.Comment: 22 page

    Dilated Convolution with Learnable Spacings: beyond bilinear interpolation

    Full text link
    Dilated Convolution with Learnable Spacings (DCLS) is a recently proposed variation of the dilated convolution in which the spacings between the non-zero elements in the kernel, or equivalently their positions, are learnable. Non-integer positions are handled via interpolation. Thanks to this trick, positions have well-defined gradients. The original DCLS used bilinear interpolation, and thus only considered the four nearest pixels. Yet here we show that longer range interpolations, and in particular a Gaussian interpolation, allow improving performance on ImageNet1k classification on two state-of-the-art convolutional architectures (ConvNeXt and Conv\-Former), without increasing the number of parameters. The method code is based on PyTorch and is available at https://github.com/K-H-Ismail/Dilated-Convolution-with-Learnable-Spacings-PyTorchComment: Published in ICML 2023 Workshop on Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and Simulators. 202

    Audio classification with Dilated Convolution with Learnable Spacings

    Full text link
    Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation. Its interest has recently been demonstrated in computer vision (ImageNet classification and downstream tasks). Here we show that DCLS is also useful for audio tagging using the AudioSet classification benchmark. We took two state-of-the-art convolutional architectures using depthwise separable convolutions (DSC), ConvNeXt and ConvFormer, and a hybrid one using attention in addition, FastViT, and drop-in replaced all the DSC layers by DCLS ones. This significantly improved the mean average precision (mAP) with the three architectures without increasing the number of parameters and with only a low cost on the throughput. The method code is based on PyTorch and is available at https://github.com/K-H-Ismail/DCLS-Audi

    StereoSpike: Depth Learning with a Spiking Neural Network

    Full text link
    Depth estimation is an important computer vision task, useful in particular for navigation in autonomous vehicles, or for object manipulation in robotics. Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike. More specifically, we used the Multi Vehicle Stereo Event Camera Dataset (MVSEC). It provides a depth ground-truth, which was used to train StereoSpike in a supervised manner, using surrogate gradient descent. We propose a novel readout paradigm to obtain a dense analog prediction -- the depth of each pixel -- from the spikes of the decoder. We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy. To the best of our knowledge, it is the first time that such a large-scale regression problem is solved by a fully spiking network. Finally, we show that low firing rates (<10%) can be obtained via regularization, with a minimal cost in accuracy. This means that StereoSpike could be efficiently implemented on neuromorphic chips, opening the door for low power and real time embedded systems

    The wave of first spikes provides robust spatial cues for retinal information processing

    Get PDF
    How a population of retinal ganglion cells (RGCs) encode the visual scene remains an open question. Several coding strategies have been investigated out of which two main views have emerged: considering RGCs as independent encoders or as synergistic encoders, i.e. when the concerted spiking in a RGC population carries more information than the sum of the information contained in the spiking of individual RGCs. Although the RGCs assumed as independent encode the main information, there is currently a growing body of evidence that considering RGCs as synergistic encoders provides complementary and more precise information. Based on salamander retina recordings, it has been suggested [11] that a code based on di erential spike latencies between RGC pairs could be a powerful mechanism. Here, we have tested this hypothesis in the mammalian retina. We recorded responses to stationary gratings from 469 RGCs in 5 mouse retinas. Interestingly, we did not nd any RGC pairs exhibiting clear latency correlations (presumably due to the presence of spontaneous activity), showing that individual RGC pairs do not provide su cient information in our conditions. However considering the whole RGC population, we show that the shape of the wave of rst spikes (WFS) successfully encodes for spatial cues. To quantify its coding capabilities, we performed a discrimination task and we showed that the WFS was more robust to the spontaneous ring than the absolute latencies are. We also investigated the impact of a post-processing neural layer. The recorded spikes were fed into an arti cial lateral geniculate nucleus (LGN) layer. We found that the WFS is not only preserved but even re ned through the LGN-like layer, while classical independent coding strategies become impaired. These ndings suggest that even at the level of the retina, the WFS provides a reliable strategy to encode spatial cues.Comment une population de cellules ganglionnaires de la rétine (RGC) encode la scène visuelle reste une question ouverte. Plusieurs stratégies de codage ont été étudiés à partir desquelles deux principales vues ont émergé: considérer les RGCs en tant que encodeurs indépendants ou en tant que encodeurs synergique; c'est à dire lorsque la réponse concertée dans une population de RGCs contient plus d'informations que la somme des informations contenues dans les réponses individuelles. Bien que en considérant les RGCs comme des encodeurs indépendants donne accès à l'information principale, il existe actuellement un nombre croissant de preuves qui montrent que considérer les RGCs comme des encodeurs synergiques fournit des informations complémentaires et plus précises. Basé sur des enregistrements de la rétine de salamandre, il a été suggéré [11] qu'un code basé sur les di érences entre les latences des paires de RGCs pourrait être un mécanisme puissant. Ici, nous avons testé cette hypothèse dans la rétine de mammifère. Nous avons enregistré les réponses de 469 RGCs de 5 rétines de souris. Fait intéressant, nous n'avons pas trouvé de paires de RGCs présentant des corrélations de latence claires (probablement en raison de la présence d'une forte activité spontanée). Cela montre que les paires de RGCs individuelles ne fournissent pas su samment d'informations dans nos conditions. Toutefois, en considérant la population de RGCs, nous avons montré que la forme de la première vague de potentiels d'action (WFS) code avec succès des indices spatiaux. Pour quanti er ses capacités de codage, nous avons réalisé une tâche de discrimination et nous avons montré que la WFS était plus robuste à l'activitée spontannée que les latences absolues. Nous avons également étudié l'impact du traitement par une couche de neurones. Les réponses enregistrées ont été introduits dans une couche de corps géniculé latéral arti ciel (LGN). Nous avons constaté que la WFS est non seulement préservée mais mÃame ra née à travers la couche LGN, tandis que les stratégies classiques de codage indépendant deviennent altérées. Ces résultats suggèrent que, même au niveau de la rétine, la WFS propose une stratégie able pour coder l'information visuelle
    corecore